Read: OpenIntro Statistics Sections 1.2.3, 1.2.4, and 1.2.5
Review Vocabulary: (next slide)
Vocabulary for relationships
Dependent / Associated / Correlated: Some statistical relationship exists
Positive Correlation: Higher x values linked to higher y values
Negative Correlation: Higher x values linked to lower y values
Independent: The variables are unrelated. No relationship exists.
Explanatory and Response variables: Variables causing an effect, or being effected, or expected to.
Correlation Coefficient \(\rho\): A numerical measurement of correlation. The output of df.corr()
Using vocabulary / testing understanding
Based on your best understanding or best guess…
Among students, weekly study time and GPA are likely _____
Among mammal species, heart rate and mass are likely _____
Among American football players, #tackles and #rushing yards are likely _____
Among people, eye color and ice cream preference are likely _____
Among Americans, income and political party affiliation are likely _____
Year by year, global average temperatures and the Dow Jones Industrial Average are likely _____
Correlation Coefficient
Just how correlated are two numerical variables?
The correlation coefficient \(\rho\) measures strength and direction of correlation.
Use df.corr(numeric_only=True) to calculate.
\(\rho\) nearly 1: x and y are strongly positively correlated.
\(\rho\) nearly 0: x and y are independent, at least not linearly correlated.
\(\rho\) nearly -1: x and y are strongly negatively correlated.
Correlations quickly flag trends and interrelationships for further study.
Scatterplots and correlations
A scatterplot is 2D graph depicting each data case as a point positioned on an \(x\)-axis according to one variable and on the \(y\)-axis according to another. Like most graphs, they primarily target sighted users.
import seaborn as snsimport pandas as pdimport matplotlib.pyplot as plt# Load datadf = pd.read_csv("../openintro.csv/cdc.samp.csv")# Create plot with clear labelssns.regplot(data=df, x='weight', y='wtdesire')plt.xlabel("Actual Weight (lbs)")plt.ylabel("Desired Weight (lbs)")plt.title(r"Actual vs. Desired Weight ($\rho = 0.74$)")plt.show()
A scatterplot is 2D graph depicting each data case as a point positioned on an \(x\)-axis according to one variable and on the \(y\)-axis according to another. Like most graphs, they primarily target sighted users.
import seaborn as snsimport pandas as pdimport matplotlib.pyplot as plt# Load datadf = pd.read_csv("../openintro.csv/cdc.samp.csv")# Create plot with clear labelssns.scatterplot(data=df, x='age', y='weight')plt.xlabel("Age (years)")plt.ylabel("Weight (lbs)")plt.title(r"Age vs. Weight ($\rho = 0.09$)")plt.show()
Figure 2: Scatterplot (age vs weight) showing no significant correlation (\(\rho = 0.09\))
Scatterplots and correlations
A scatterplot is a 2D graph depicting each data case as a point positioned on an \(x\)-axis according to one variable and on the \(y\)-axis according to another. Like most graphs, they primarily target sighted users.
import seaborn as snsimport pandas as pdimport matplotlib.pyplot as plt# Load datadf = pd.read_csv("../openintro.csv/cdc.samp.csv")# Create plot with clear labelssns.regplot(data=df, x='age', y='height')plt.xlabel("Age (years)")plt.ylabel("Height (lbs)")plt.title(r"Age vs. Height ($\rho = -0.30$)")plt.show()
Sir Ronald Fisher, British statistician and geneticist, introduced his now famous Iris data in 1936, with 150 cases involving three very similar species:
His dataset is often used as the “hello world” of data exploration.
Correlations can change when data is regrouped.
[ ### A negative Correlation
import pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltiris = pd.read_csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv")sns.lmplot(data = iris, x='petal_width', y='sepal_width')plt.show()
Scatterplot showing weak negative correlation between petal width and sepal width
]{.column width=“50%”}
[ ### 3 positive correlations
import pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltiris = pd.read_csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv")sns.lmplot(data = iris, x='petal_width', y='sepal_width', hue="species")plt.show()
Scatterplot showing three strong positive correlations between petal width and sepal width
]{.column width=“50%”}
Types of statistical association
Matt says “Education level and political party association are positively associated.” Explain why Matt must be wrong, regardless of politics.
Fiona says “Learning piano has no bearing on hair color, and hair color does not influence piano interest or ability. Therefore hair color and piano skill are statistically independent.” Find the flaw in her logic.
Individual work
Solve five problems on webwork under “2. Association and Correlation.”
As always, find the preparatory work for the next slide deck and do it before class.